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-The final report for this grant consists of the following three parts: 
Part I : Executive Summary 

Part II : Report on Computer Simulations - 

Part III: Audio Tape of Simulations 

This document includes Part I and Part II along with a summary description of 
the contents of the audio tape. Part II provides a detail description of the 
specific algorithms and parameters employed including parameter quantization 
levels. Also included are the Language C computer programs of the simulations 
used on UCLA's MASS COMP computer 
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PART I 


EXECUTIVE SUMMARY 
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Object ive 


The goal of this contract was to develop a new technique for low coat 
robnst voice compression at 4800 bita per second. Our approach was based on 
using a cascade of digital biquad adaptive filters with simplified multipulse 
excitation followed by simple bit sequence compression. 

Initial Results 

Digital biquad adaptive filters are relatively easy to implement and 
compare well with the more commonly used LPC filters. This was shown by Mar- 
tin and Sun [1.2] of UCLA. Work in this contract applied these biquad adap- 
tive filter results to voice compression at 4800 bits per second. The genera- 
tion of multiple excitation was based on combining the well known (M.L) tree 
search algorithms [3] followed by short block compression algorithms. 

The work on this contract started with the basic block diagram shown in 
Figure 1. Here speech sampled 8.000 times a second with 12 bit quantization 
is denoted by j. Eight adaptive biquad filter coefficients k corresponding to 
a cascade of four biquad filters were computed (using the Martin and Sun algo- 
rithms) and sent to the receiver once every 160 samples. The same coeffi- 
cients were used in the speech synthesis model in the (M.L) tree search ago- 
rithm. 


The (M.L) tree search algorithm assumes that binary symbols enter the 
cascade of four biquad filters at a rate of 8.000 bits per second. After each 
bit enters the filter an estimate of the speech sample exits. The inputs and 
outputs of these filters are represented by a binary tree illustrated in Fig- 
ure 2. Starting at some initial filter condition, all possible binary inputs 
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Model of Speech Process 





to the cascade of filters and their corresponding outputs are represented by 
the tree. Potential estimated speech samples are labeled on each branch of 
this tree. Shown below are the actual speech samples denoted by s^ . ij,.., 

The goal of the tree search algorithm is to find the inpnt binary 
seqnence to the cascade of focr biquad filters so that the corresponding out- 
puts :s match" the actual speech as close as possible. This work initially 
examined the following criteria: 

2 

. mean square error (s-s) 

. magnitude. |s-s|. 

. third power magnitude, |s-s| 

. fourth power. |s-s|* 

Subjective listening to compressed speech for each of these criteria showed 
that the fourth power was slightly better than the fifth power, third power, 
and the mean squared error. Differences between these criteria were small. 
Finding the "best*' binary sequence amounts to searching all possible paths in 
the representation tree and comparing each path output sequence with the 
actual speech sequence using some criterion as given above. Since the number 
of paths grows exponentially with the number of tree branches (depth of the 
tree) a more practical tree search approach is required. Also, because there 
was only small differences in the above criteria, we selected the man square 
error criterion for the remainder of this work. 

The (M.L) tree search algorithm is a suboptimum tree search algorithm 
that keeps track of only N survivor paths of L branchs in length at any given 
time. It also requires that all survivor paths originate from the same node L 
branches from the end. At a rate of 8.000 times a second in the (M.L) tree 
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search algorithm, each of at most M surviving paths are extended by one branch 
forming temporarily op to 2M paths. The single best path for L+l branches is 
competed and its initial leftmost branch path is chosen. Among the M-l next 
best paths only those following this leftmost branch path is chosen as sur- 
vivors along with the best path. Thus there is at least a delay of L sample 
times in the (M,L) tree search algorithm. Binary path decisions are made on 
the basis of examining at most M most lihely candidate paths of length L at 
any given time. 

Figure 3 illustrates an example of an M «= 4 and L = 3 tree search algo- 
rithm. Beginning at the starting node all paths for L * 3 branches is con- 
sidered. Only the top M *= 4 of the 8 possible paths sre selected. The end 
nodes of these surviving paths are circled with the single best path shown 
with a solid circle. Next only those surviving paths on the same half of the 
tree as the one best path is extended by one branch. Among these 6 paths only 
the top If *= 4 are selected as survivors with the best path again shown with a 
solid circle on the end node. Now only those surviving paths sharing a common 
node L «= 3 branches bach with the best path are extended. This process 
results in a path sequence being selected. 

The (M.L) tree search algorithm was investigated for values of M «= 
2,4.8.15.32 and values of L ^ 8,16,32. The resulting binary sequence r 
represented the binary sequence into the receiver's cascade of biquad filters 
that results in a output sequence that is "close" to the actual speech. 

Up to this point we had a 9600 bits per second voice compression system. 
8000 bits per second of excitation and 1600 bits per seconds for parameters. 

M « 8 and L « 32 was adequate but the compressed speech sounded rather noisy 
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Figure 3. Full Tree M=4, L= 
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and there was occasional distortions 


To reduce the data rate to 5600 bits per second, the next step vas to do 
a 2:1 compression of r . Ve found that any compression algorithm that used 
more memory or large block codes tended to sound worse than those simpler 
shorter codes. In general, any compression algorithm has the same effect as 
transmission errors on the uncompressed sequence and short codes tended to 
"localize" this error. 

Using various ad hoc simple short block codes for data compression, and 
reducing the parameter quantization to 800 bits per second, ve found that the 
4800 bps speech vas much more noisy than the 9600 bps speech. This vas 
expected. However, the resulting speech had a natural sounding quality to it 
compared to conventional 4800 bps LPC speech compression. The conclusion vas 
that conventional LPC speech vas relatively noise free but the speech itself 
had an "electronic accent." Our approach resulted in natural sounding speech 
but vith considerable background noise. This vas where ve were at the end of 
the first three month period of this contract. 

Punctured Tree Search A1 aorithms 

During the first three months ve discovered the now obvious result that 
better overall performance could be achieved if the (M.L) tree search took 
into account the impact of data compression. This led to the concept of punc- 
tured tree search algorithms that combine tree search and data compression 
into a single algorithm. This algorithm turns out to be the natural source 
coding dual to punctured convolutional codes used in channel coding [4] . 

Hence ve call these algorithms punctured (M.L) tree search algorithms. 
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The aev system is sketched is Figure 4. To illustrate the punctured 
tree algorithm consider the example illustrated in Figure 5 where we choose M 
= 4 and L ** 5. Eere we assume initially that every other bit transmitted over 
the channel is now eliminated. This results in a 2:1 compression.. The punc- 
tured tree algorithm takes this into account by constructing a new tree shown 
in Figure 5 where there is only one branch leaving each node corresponding to 
those cases where nothing enters the receiver's biquad filters. Essentially 
the same basic (N.L) algorithm ia used except now the tree diagram that models 
the receiver's speech generation procesa is modified by the various data 
compression algorithms. 

In this research various punctured tree search algorithms were examined. 
To achieve 4000 bps for the residual, we first tried eliminating half the 
transmitted bits in a binary transmitted sequence. This is essentially the 
type shown in Figure 5. Another example of 4000 bps is to send two bits (one 
of four amplitudes) one out of every four sample times. This results in a 
punctured tree with a repeated pattern of one branch leaving each node for 
three nodea followed by four branches leaving the next node. 

Uaing the punctured tree algorithms, we obtained better compressed 
speech quality. There aeemed, however, a limit on further improvement due to 
some instabilities of the adaptive algorithm for finding biquad filter coeffi- 
cients. 

Stabil ixing the Adaptive Biquad Filter Algorithms 
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The adaptive biquad filter algorithm of Martin and Son [1,21 haa the 


fora 


K(n + 1) «= K(n) - uS(u)d(n) 


▼here 


d(n) «= residual signal at tiae n 


S(n) - 6 

u = positive constant. 

This is a gradient tracking method. Here K(n) is a typical filter coeffi- 
cient. There are tvo such coefficients for each of four biquads used in this 
study. 


Occasionally we observed instabilities in the adaptive algorithm and 
tried modifications 


K(n 4 1) - K(n) - u sgn[S(n)l d(n) 

and 

K(n 4 1) * K(n) - u sgn [S(n)d(n)l . 

The first had the advantage of small step size when d(n) is close to zero. 
While the second approach limits the maximum step size. The best compressed 
speech was obtained by using both of these in the form 


Kn) - Uj sgn[S(n)d(n)l if |d(n))2 T 


K(n 4 1)-= 


K(n) - u^sgn [S(n)]d(n) if |d(n)|< y 


This requires careful selection of parameters Uj , Uj. and 
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with the 


Unfortunately. good choice* of parameters Wj.mj. and varied 
speech staples nsed in onr tests. In particular good choices of these parame- 
ters depended on the sampled speech power level. This led to the root Dean 
square (ns) normal ization scheme illustrated in Figure 6. In addition to 
this normalisation we found that stability of the adaptive algorithm was esta- 
blished for all onr sampled speech by clipping very large speech samples after 
the normalisation. The clipping threshold introduced another parameter to be 
selected for the 4800 bps voice compression system. 



Conclusions 


The voice compression system shown in Figure 4 together with the rms 
normalization and clipping process shown in Figure 5 is the final 4800 bps 
voice compression system that evolved in this contract research. Our estimate 
of the required computation speeds indicate that this voice compression system 
c&s is implemented on a single IBM PC board using two Texas Instruments TMS 
32010 digital signal processor chips. Also some general control processor 
chip such as a Motorola 68000 may be needed. 

The simulation results at 4800 bps had very natural sounding speech com- 
pared to LPC synthesis techniques. It has. however, much more quantization 
noise. To test the robustness of the system we considered voice with back- 
ground interference. Since this system is basically a waveform tracking 
approach, as expected, it is very robust to background interference. This may 
be the system's most important property. 

This work represents an initial investigation of the application of two 
new concepts in voice compression: 

1. Biquad Fil ters 

2. Punctured Tree Search 

In the 9 month contract period we feel that we have illustrated the practical 
feasibility of these new concepts and recommend that further work be conducted 
on this system. Specifically, we recommend developing a single board proto- 
type implementation of the system for further testing. For the mobile satel- 
lite service applications where robustness is important, the 4800 bps voice 
compression system developed here appears to be a good candidate. More work. 
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however. is required. Our contract research work was limited by slow process- 
ing where two seconds of speech took approximately 20 minutes of time on the 
time shared MASSCOMP computer. This makes it difficult to do more extensive 
testing of the many variations of the system. 
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PART II 


REPORT ON SIMULATION 
OF 

LOW COST VOICE COMPRESSION FOR MOBILE DIGITAL RADIOS 
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1. INTRODUCTION 


This report presents details of the computer simulations of the 
voice compression system. It is assumed that readers understand the 
basic performance characteristics of the digital biquad filter and the 
conventional (M,L) tree encoding scheme[l,2,3,4] . 

The simulations have been conducted on the MASSCOMP computer sys- 
tem, where program language C has been used. The voice compression 
system has two types of information to be transmitted through a noisy 
channel to a receiver; the residual signal, and parameters represent- 
ing the biquad coefficients and two root-mean-square(r .m. s . ) values. 
Based on the transmission rate for the information on the residual 
signal, the voice compression system is called sys-16k, sys-12k, 
sys-8k, or sys-4k, where the numerical digits denote the residual sig- 
nal transmission rate. For example , sys-8k needs 8000 bits per second 
for the residual signal. For sys-16k, sys-12k, and sys-8k an addi- 
tional 1600 bits per second was used for transmitting system parame- 
ters resulting in total data rates of 17.6kbps, 13.6k bps, and 9.6kbps 
respectively. For sys-4k, 1600 bps and 800 bps were used for system 
parameters. There was little difference in subjective speech quality 
between these two cases in sys-4k so that 800 bps was used in the fi- 
nal 4800 bps system. 

We divide the voice compression system into five subsystems; speech 
source, input normalization, speech analysis, tree search algorithm, 
and speech reconstruction. The block diagram of the system is 
sketched in Figure 1. 


Digital Normalized Biquad 

speech speech coefficients 



binary bits 

Figure 1: Block diagram of the system 

The following sections examine the simulation behaviors of 
individual subsystems in detail. Signal processing is based on a 
frame, where the length of frame is usually 20ms. Thus, the time in- 
dex n, denotes the n-th sample of the current frame. 

2. SPEECH SOURCE 

There are two different types of speech files in the MASSCOMP com- 
puter system. The first type is the original test set of 16-bit quan- 
tized speech sampled at 8000 times per second which was obtained from 
Professor Tom Barnwell of Georgia Tech. Since the A/D and D/A con- 
verters of the MASSCOMP can handle only 12-bit quantized samples, to 
convert this digital speech into an analog signal, the digital samples 

of this original test file must be divided by a number higher than 2 * 
before entering it into the D/A converter. Appendix I describes this 




original test set of quantized speech used for most of this contract 
work . 

- The second type is the set of 12-bit quantized speech that we gen- 
erated ourselves at various sampling rates. The generating process is 
illustrated in Figure 2. 

Inside of the MASSCOMP 


i i 



Select a sampling rate 


Figure 2: Generating process of a speech file 

We first record the segment of a voice on a cassette tape, where the 
maximum number of samples for the MASSCOMP is 32000, i.e. 4 seconds 
at the sampling rate of 8000 per second. When we replay the segment 
through the low-pass filter into the MASSCOMP, we can choose a specif- 
ic sampling rate by modifying an integer of the computer command 
statement. Since the A/D converter of the MASSCOMP was used, these 
speech files are 12-bit quantized samples. 

The low-pass filter in Figure 2 is an active filter using switched 
capacitors. The bandwidth is contolled by the selection of the clock 
frequency. The clock oscillator operates the switched capacitors. For 
a specific sampling rate and a bandwidth, a very narrow-band tone is 





generated and added to the original segment of voice. The cause of 
this is due to subharmonic components of the clock signal. One way to 
reduce such a undesired noise is to. change the clock frequency until 
the noise meets a desired level. Because of this undesirable tone due 
to our active filter. we used a relatively wide front end bandwidth. 
Thus, the actual bandwidth we had for the second type of files is much 
wider than A. 4kHz , where the controllable minimum bandwidth of the 
low-pass filter is 4.4KHz. 

If any amplitute of the signal out of the low-pass filter is great- 
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er than a voltage level of 2 , the A/D converter changes it to zero 

12 

instead of truncating to 2 . This is also true for the negative am- 

plitudes. Thus, a voice segment should be recorded on a cassette tape 
so that the amplitudes from the low-pass filter voltage range between 
12 12 

-2 and 2 . Otherwise, the resulting quantized speech has large 

discontinuities. We took care of the above problems when we generated 
the speech files. The noise in these speech files is negligible. 

3. INPUT NORMALIZATION 

It has been observed that the adaptive estimator for biquad coeffi- 
cients works well, when input amplitudes entering the inverse biquad 
filter( speech analysis filter) are less than a voltage level of about 
1.5. Under this condition, a good value of the gain factor in the 
recursive update formula is u = 0.0625. If either the input voltage 
amplitude is much greater than 1.5 or u >> 0.0625, the voice compres- 



sion system can become unstable, in which case our computer simulation 
stops. To eliminate instability and to achieve a better estimation of 
the adaptive biquad coefficients, we need to use normalization prior 
to the speech anlysis. 

/ 2 

We compute the r.m.s. of input samples by y£ s (n) / FRAME, where 
FRAME = frame length , the summation is over the frame, and s(n) is 
the n-th input speech sample of the current frame. The modified 
r.m.s. of the frame with a dc-bias has the form 


rmsl = 


(/l s 2 


(n)/ FRAME + P 2 ) 


( 1 ) 


The purpose of & 2 is to avoid the case that rmsl is equal to or close 
to zero. Our choice is 

= 2.0 and & 2 = 0.1 (2) 

Different values of and & 2 do not make much of a difference in sub- 
jective speech quality, while the value of u should be chosen to ob- 
tain good quality. Thus, we fixed these values throughout the simula- 
tion, and optimized other system parameters. 


s(n) _ 

Delay of 
one frame 


1 ^ s(n) 




' " rmsl(n) 


computation of rmsl 
and interpolation 


rmsl(n) 


Figure 3: Block diagram of the normalization 





The block diagram of the normalization is sketched in figure 3. 
Other more easily implemented forms of normalization were not examined 
here. Linear interpolation of rmsl is employed to avoid an abrupt 
change of envelop over the junction between two frames. Let rmsl(n) be 
the interpolated rmsl of the n-th component in the current frame. It 
is given by 


rmsl(n) = rmslp + (n+1) 


(rmslc - rmsl] 
FRAME 


where rmslc and rmslp are the r.m.s's of the current and the previous 
frames respectively. In the simulations, we implement this in the 
form of (4) . 


rmsl(n) = rmsl( n-1) + A 


(4) 


where A = (rmslc - rmslp)/ FRAME, and n = 0, 1, . . . , FRAME- 1. 

We compared two cases in quality ; with and without the interpola- 
tion. The case without the interpolation sounds like discontinious 
voice, while with interpolation there is no noticeable discontinuity. 
A simple graphic illustration is shown in figure 4, where FRAME = 10, 
and rmslp = 3 is assumed. Notice that the envelop without the inter- 
polation has a different shape. 

We also observed occasional instability in the voice compression 
simulations when employing the normalization. This means that there 
are still some of normalized voltage amplitudes that are larger than 
1.5. To limit them to some level around 1.5, we add a clipping device 









shown in figure 5. With this clipping we had no instability in any of 
our tests . 



Figure 5: Clipping device in the normalization 

Let s(n) be the final output from the normalization and clipping 
which is given by 

! fi , if | s(n) / rmsl(n) | > 6 

(5) 

s(n)/rmsl(n) , otherwise 

As the value of 6 varies, we have different voice qualities. One 
good choice of 6 is around 1.0. The rough demonstration of voice 
quality with respect to 6 is as follows. 


0 1.0 


Smooth sound, and 
hissing noise in 
background. 


6 


Clear and discontinous sound, 
and instability at high region. 


If we double , 6 should beredyced by one half. However, both 
cases provide a similar voice quality. 

4. SPEECH ANALYSIS 

The inverse biquad filter analyzes a speech spectrum over 20 ms. 
The filter consists of four inverse biquads in cascade. The i-th in- 
verse biquad estimates i-th formant frequency f ^ and the sharpness Q, 

of its spectrum envelop around f^. Finally it notches out the input 

spectrum in the sense of minimizing the residual power. Let and k 2 

be the coefficients of the i-th biquad. The relationship between ( 

f.,Q) and ( k k ) is roughly 
l i , z 

f i * kj f s m (6) 

where f g is the sampling frequency [Hz] . 

Q = l/k 2 (7) 

The block diagram of the speech analysis system is sketched in figure 
6. The transfer function of each inverse biquad is 

H'^Z) = kj [1 - (2 - k 2 k 2 - k 2 )Z' 1 + (1 - kjk^Z' 2 ] 

where each biquad has a different pair ( k^,k 2 ). 


( 8 ) 



Normalized 

speech 


Residual 

signal 


s(n) 



Figure 6: Block diagram of the inverse biquad filter 


The main problems of this subsystem are how to build a recursive 
update algorithm to accurately estimate the coefficients k^'s, k 2 's, 

and how to establish stability in the algorithm. It was observed that 
with a large value of u ( » 0.0625 ) the simulation program can stop 
due to instabilities. 

4.1 Recursive update algorithm 

A simplified gradient expression of the recursive update algorithm 
is 

k i (n+l) = k^ (n) - u s^n) r(n) , i = 1,2 (9) 

where r(n) is the output of the inverse biquad filter, and 
8 r(n) 

s^n) = g k ^ (sensitivity term) (10) 
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where s^(n) can be implemented by a second-order filter. The meaning 

2 

of 2 s^nlrCn) is , in fact, the slope of r (n) with respect to k^(n), 
i.e. 


3 r(r.)* e ~ 3 r(n) 

3 k ± (n) 3 k i (n) 


r (n) = 2 s (n) r(n) 


(ID 


We can control the tracking speed by the gain factor u. It also 
should be noticed that the update size Ak^(n) = k^(n+l) - k^(n), heav- 


ily depends upon the input power entering the inverse biquad filter. 

* 

The gradient update formula is illustrated in Figure 7, where k ^(n) 


is the minimizing coefficient. 



Recall that we already used a clipping device in the input normali- 
zation to avoid any high amplitudes. Even though the input levels are 


limited, the final residual signal r(n) is sometimes too large to keep 
a desired update size for a good estimation. To keep a robust update 
size, we employ a clipping device as shown in figure 8. 



Figure 8: Clipping in the recursive update algorithm 

Thus, our final version can be given by (12). 

f k^n) - u £ sign [s^n) r(n)] , if |r(n)| > * 

( 12 ) 

kj.(n) - u sign [s^(n)] r(n) , otherwise. 

A careful choice of u £ and TS is required because they have an effect 

on quality and instablity. According to our tests, a good choice is 
u £ = 0.018 and Jf is around 0.5, where 6 = 1.0 ( see figure 5). Using 

these values, we have not yet found an unstable case, and we can ob- 
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tain good quality compressed voice. However, quality varies as u 
changes. Figure 9 demonstrates a rough behavior of quality. Our si- 
mulations use a single value of u for all four biquads. We tested 
some different combinations, but they didn't make much of a difference 
in quality. 


0.02 0.03 


u 


i 1 


Hissing noise 
in background 


Clear but clicking sound 



l ) 

Proper region 

.where Z = 0.5 and = u»T 

Figure 9: Qulaity vs. u 


It is necessary to compute the average of k^ over each frame, which 
is then transmitted over the channel. We use the average value 


k. 

i 



k. ( n) 

FRAME 


(13) 


where the summation is over the frame. This requires a frame delay, 
averaged k^ is finally tested by a stability checker to determine 


whether or not the biquad H^(Z) ( not (Z) 


is stable. The next 


subsections consider the stability checker and the quantization pro- 


cess of the biquad coefficients. 


4.2 Stability check of H(Z). 

The transfer function of the biquad in the speech synthesis is 
H,(Z) = 1/ [1 - (2- k 2 k 2 - k 2 )Z -1 + (1 - k^Z' 2 ] (14) 

The necessary and sufficient condition for the stability of H^(Z) is 


kj k 2 > ° 


and 


+ k 1 k 2 > 2 


(15) 


If the output ( k^n), k 2 (n) ) of the recursive update algorithm vio- 
lates the constraint of (15), we make a modification. Instead of 
checking (15) directly, we use a look-up table. First we set the low- 
er and upper bounds of formant frequencies of most practical utterenc- 
es. Using the relationship between formant frequency and k^ ( see (7) 

), we can compute the corresponding bounds of k^'s. Next the bounds 
of k 2 's are calculated by (15). The table 1 shows the bounds. 

For example, suppose that we have k^(n) = 0.215, u = 0.025, and s^(n) 
> 0. Then k^(n+l) = k^(n) - u sign( s^(n) ) r(n) = 0.19. Since 
kj(n+l) =0.19 is lower than the bound, we set k^(n+l) =0.2. It is 
obvious that the average of k/s are in the stable region. 

4.3 Quantization of biquad coefficients 
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Table 1. Lower and Upper bounds of k^ and k£ 


■ 

1-st Biquad 

2-nd biquad 

3-rd biquad 

4-th biquad 

■ 

Lower 

Upper 

Mj 

Upper 

Lower 

Upper 

Lower 

Upper 

k, 

A 

D 

0.8 

H 

1.15 

D 

1.64 

n 

1.87 

k 2 

n 

1.312 

0.05 

1.16 

0.01 

0.398 

0.01 

0.13 


Our voice compression system allocates 1350 bits per second ( 27 bits 
per 20 ms.) to transmit the 8 biquad coefficients. Appendix 2 gives a 
procedure for deriving an optimal bit allocation scheme for our sys- 
tem. Based on this anlysis, table 2 shows the bit allocation we use. 


Table 2. Bit allocation for biquad coefficients 



1-st biquad 

2-nd biquad 

3-rd biquad 

4-th biquad 

H 

4 

4 

4 

4 

k 

2 

3 

3 

3 

2 


We compared simulation results for two cases; vrith and without the 
quantization, where optimized parameters of both cases are different. 
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There was no apparent degradation for sys-12k, sys-8k and sys-4k. 
Since sys-l6k provides very good quality, there was some noticable 
degradation caused by this quantization. 

Three places in the voice compression system use the quantized 
coefficients; computation of rms2 (to be discussed), biquad filter of 
(M,L) tree search algorithm, and the speech reconstruction, where li- 
near interploation is applied to the quantized biquad coefficients. 
The interpolation is 

k. (FRAME) - k.(-l) 

k (n) = k (-1) 4 (n41) -i i (16) 

FRAME 


where 

k^(-l) = quantized coefficient for the previous frame, 

k^(FRAME) = quantized coefficient for the current frame, 

and n = 0, 1, 2, . .. , FRAME - 1. We implement the interpolation 
by using the same simple version as in the input normalization. When 
we did not apply the interpolation, we had some clicking sound. 

Appendix 3 lists the quantized biquad coefficients of sys-4k. Not- 
ice that all the values meet the constraint for stability. 



5. TREE SEARCH ALGORITHM 

The (M,L) tree search algorithm searches through branches of the 
tree populated with outputs from the biquad filter. It searchs for the 
best input sequence of digits so that the corresponding outputs pro- 
vide minimum distortion with respect to the original speech. The best 
sequence of input digits is encoded into a binary sequence. Then it is 
sent through a noisy channel. The transmission rate can be determined 
by both the encoding scheme and the populating method. The biquad 
filter here consists of four biquads in cascade. 

/K 2. 

Our simulation used ( s(n) - s(n) ) as the distortion criterion, 
where s(i) is the original sample and 1>(i) is the corresponding output 
of the biquad filter. Other alternatives are | s(n) - £(n) | , and 

|s(n) - *(n) | P ’ p > 2, etc. When we compared the squared error and 
the absolute error, we just felt that the squared error criterion is 
slight better. 

If there is no restriction on the transmission rate, we can use 
enough bits to accurately represent residual samples from speech ana- 
lysis, and send them to a receiver that can recover the original 
speech samples. Suppose that the sampling and transmission rates are 
both 8000 bit per second, where we represent each residual sample by 
either +1 or -I. In this case, we actually generate a constant c so 
that either +c or -c hits the biquad filter. We call the constant the 
exciting reference denoted by rms2. It is desired to generate rms2 so 
that the outputs of the biquad filter are close to the original ones. 

5.1 Exciting reference and Multi-level assignment 
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One natural choice of rms2 is the r.m.s. of the residual signal 
from the inverse biquad filter. The computation of rms2 is illustrated 
in figure 10. It should be noticed that the inverse biquad is identi- 
cal to the inverse of the biquad except the dc gain. Thus, there is a 

_2 

division on the residual output by II k^^(n), where the product is 

from i=l to i=4, and k^(n) is the interpolated k^(n) of the i-th bi- 
quad. 



Figure 10: Computation of rms2 

Let b be the number of bits representing a residual sample. If the 
sampling rate is 8000 times per second, then the transmission rate is 

8000 b bits/sec. One of 2^ amplitudes can be transmitted in a binary 
form. For the case of b = 2, the actual amplitudes entering the bi- 
quad filter are denoted by 


- a 2 rms2(n) - a 1 ros2(n) 0 


ajrms2(n) 


a 2 rms2(n) 





The choice of ( a^, a is very important to produce a good quality. 

Suppose that b bits represent m samples. We then have for the residu- 
al signal transmission rate ( b/m ) x sampling rate. With a combina- 
tion of b and m, we can build a variety of voice compression systems. 
For example, we can construct two different sys-8k's, where one has ( 
b=l, m=l) and the other has (b=2, m=2) , with every other sample punc- 
tured out to zero. These two systems are demonstrated in Figure 11. 


Exciting 

amplitude 


0- - 


1 r 


-i 1- 


time 


(a) sys-8k with ( b=l, m=l ) 



time 


(b) sys-8k with (b=2, m=2; every other one punctured) 
Figure 11: Example of two different sys-8k's . 


We compared three sys-8k’s; (b=l,m=l), (b=2,m=2) where every other 
one is punctured out to zero, and (b=3,m=3) where two other samples 
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Table 3. Quality comparison 


System discription 

Quality 

b=l , m=l 

smooth and heavy hissing noise 

b=2, o=2 

clear and light electronic accent 

b— 3 , m=3 

clear and heavy electronic accent 


are punctured out to zeros . The rough quality judgement is shown in 
table 3. The quality between (b=l,m=l) and (b=2,m=2) has a different 
aspect. It is not easy to conclude which one is better. The (b=2,m=2) 
case seems to be good for specifically male voices, while the other is 
good for female voices. When we take 80 samples as the frame length 
instead of 160 samples, the (b=2,m=2) provides better overall quality 
for female and male voices. Thus, we selected the (b=2,m=2) and tried 
to optimize its system parameters, where the frame length is 160. We 
ran several different utterances to find proper values of multi-levels 
for sys-8k with (b=2,m=2). Figure 12 illustrates the effect of a^ and 

a^ on quality. 

For sys-4k, we tried a generalization of the punctured system 
where, instead of eliminating every other bit by puncturing, we re- 
place a short block sequence by another block sequence. This is es- 
sentially a simple block compression scheme where punctured systems 
are special cases. Our tests have shown that the use of simple block 
compression provides better quality than punctured systems, which have 
heavy electronic accent at 4kbps residual data rates. The block com- 


4 0 












a, = 0.2 was fixed. 


Smooth sound < J > Clear and 

with hissing noise discontinuous sound 

1 

0.8 ( good choice) » a ^ 


~ 0*® was fixed. 

< ( > 

Electronic accent I Hissing noise 

1 

0.2 ( good choice ) ^ a^ 

Figure 12: Effect of a 1 and a^ of sys-8k 

pression can be implemented by generating a code. A good code we 
found is shown in table 4, where the block length is 4 bits. • Since 2 
bits represent 4 samples, the transmission rate for the residual sam- 
ples is 4000 bits per second. The tree search was done taking into 
account this block compression. 

Except sys-4k, we employed only the punctured scheme for our voice 
compression systems. For sys-12k, we have b=3 and m=2, where every 
other sample is set to zero. For sys-16k, we have b=2 and m=l. Since 
sys-16k with (b=2,m=l) provided a good quality, we did not try any 
other combination of b and m. The choice of parameters of all the 
system are summaried in section 7 . 

5.3 Effect of M and L 


4 / 




Table 4. Code for sys-4k ( = 0.3, a^ = 0.65) 


Codeword 

S3 

II 

n=2 

n=3 

n=4 

0 0 

a • rms2(n) 

0 

a-rms2(n) 

0 


2 


2 


0 1 

0 

a-rms2(n) 

0 

-a*rms2(n) 



2 


1 

1 0 

0 

-a -rms2(n) 

0 

a-rms2(n) 



2 


1 

1 1 

-a-rms2(n) 

0 

-a 'rms2(n) 

0 


2 


2 



The (M,L) tree search algorithm keeps track of only M best paths 
in the populated tree. The decision of a best branch is made on a 
previous one of L branchs in depth from the current node having the 
smallest accumulated error. The number of extension branches at each 

survivor node is 2^ For a punctured-out branch, just one branch is po- 
pulated whose output is corresponding to the input value of zero. At 
each sample, we compute the accumulated errors of all extended nodes, 
and select the best M nodes. After making a decision of the best 
branch, we eliminate any of M nodes which does not have the same root 
as the best current node does of L branchs in depth. Thus, there are 
at most M survivor nodes. An example of (M=3,L=3) tree search algor- 
ithm for sys-8k and sys-4k are illustrated in figure 13. 

Search time and voice quality depend upon M not L. The MASSCOMP 
computer system takes around 20 minitues a simulation of sys-8k for M 














(b) Sys-4k with the block compression of table 4. 

Figure 13: Example of (M=3,L=3) tree search algorithm 




= 7 and L = 32, where a 2 second utterance is tested. The simulation 
time increases exponentially with M. Differences in quality between 
M = 3 and M = 5 seem much larger than that between M = 7 and M = 9. M 
= 7 , however, provides good quality for sys-16k, sys-12k and sys-8k. 
As we can see in figure 13, sys-4k having M = 7 takes a much shorter 
time than other systems having the same M. Thus, we took M = 9 for 
sys-4k. 

L represents the decision depth in the tree search. The simulation 
time actually does not depend upon L. L has an effect on smoothness 
in quality. According to our tests, the large value of L gives more 
smoothness but not much. The proper choice is either 16 or 32 for the 
sampling rate of 8000/sec. L = 32 or L = 64 might be good for the 
sampling rate of 16000/sec. We take L = 32 for all the systems. 

6. SPEECH RECONSTRUCTION 

The corresponding input sequence to the best outputs in the tree 
search is transmitted with biquad coefficients to a receiver. Copying 
the same process as used in the tree search algorithm, the receiver 
can reproduce the sequence of the best outputs. Since the D/A con- 
verter of the MASSCOMP is good for 12-bit quantized samples, we check 
the amplitudes of the final outputs by using a clipping device. It is 
shown in figure 14. 
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Output of the 
biquad filter 


Figure 14: Amplitude checker in speech construction 

7. SUMMARY AND DISCUSSION 

7 . 1 Summary 

The voice compression system has been divided into 5 subsystems; 
speech source, input normalization, speech analysis, tree search al- 
gorithm, and speech reconstruction. These individual subsystems have 
been investigated and optimization of parameters have been done. 

Table 5 lists the main symbols used in this report. Table 6 sum- 
maries the choice of parameters which seems to be best for the speech 
files we used in our optimization. The block diagram of the simula- 
tion is sketched in figure 15. To understand the tracking behavior of 
outputs in the time domain, we drew the tracking curves of the 45-th 
frame of one of the speech files ("/usr/ee/moon/speech/spl") . They are 
shown in Figure 16. 

We recorded the simulated voice compression results on a cassette 
tape. There are 6 different types of utterances on the tape. The re- 
cording procedure and the tested contents on the tape are described in 



part III. The quantization of the biquad coefficients given by Table 
2 and Appendix 2 has been applied to all the systems except sys-16k. 
Since sys-16k provides very good quality, the quantization causes a 
noticable degradation. For other systems, it is difficult to recog- 
nize any degradation due to the quantization. 

Roughly speaking, the input normalization works well for both weak 
and strong voices, where the difference in power can be larger than 20 
dB. 

7.2 Discussions 

For practical usuage, we have to test many types of utterences ( 
specifically different pitch periods ) under real situations in order 
to take a robustic choice of parameters. If there are several locally 
optimal parameter sets, we can implement an adaptive selection of par- 
ameter sets on a hardware product. For example, the set for a very 
clear background environment is different from that for a very heavy 
noise environment. Even though it seems that much improvement is not 
expected by changing of parameter values, they must be carefully se- 
lected. 

If we can further encode the residual signal power distribution, 
there might be an improvement in quality. In our system, we sent one 
value of rms2 per frame. The technique which the system APC-4[5] uses 
might be useful here. It also seems that vector quantization is a 
useful tool to transmit the distribution with fewer bits. However, we 
tested short frame lengths, e.g. 20, 40, 80 samples and found that 



there was no big improvement, but we could feel a difference. Specifi- 
cally, sys-8k was sensitive to the frame length. 

The RELP system is known to provide good quality at higher data 
rates. If there is some way to combine the RELP system with vector 
quantization or with the tree search algoritm, it might be a good can- 
didate. However, there is no specific idea for this combination right 
now. 

Suppose we consider our system with a conventional LPC filter or a 
lattice filter instead of the biquads. Based on some preliminary 
tests we expect to have a similar result in quality and complexity. 
The biquad has a kind of pre-emphasis/de-emphasis perceptual weighting 
in it, but we can not apply to the biquad the usual noise-shaping 
technique, which is used in most of lou’-rate speech compression sys- 
tems [6] . 

For sys-4k, we think that the system studied here works well. The 
use of both the block data compression and the puncturing scheme seems 
to work well. When we apply the block data compression to sys-8k, we 
did not notice any difference. 

For sys-8k, the frame length of 80 gives a better sound (much bet- 
ter in some sense) than the length of 160. It is not true for other 
systems. Further investigation of this problem is recommended. 

For sys-16k, the quantization on the biquad coefficients causes 
noticeable degradation. One way to reduce this loss is to rearrange 
the upper and lower bounds of k^'s and ^2* s * n ta ^ e 1 so that we have 

a small quantization step-size. 



The (M,L) tree search is the most time consumming part in our 
system. An efficient device ( not the brute-force method ) of the 
searching process will be a helpful for implementing a real-time sys- 
tem. 

To achieve less complexity with the same quality, we might use the 
rms2 of the residual signal of the speech analysis rather than adding 
the filtering process for the rms2 computation in the tree search al- 
gorithm ( see figure 16). A modification of parameter values will be 


needed. 


Table 5. Symbol list 


Symbol 

Discription 

Remarks 

s(n) 

Original speech sample 

0 $ n < FRAME 

rmsl 

R.tn.s. of s(n) 


rmsl(n) 

n-th interpolated rmsl 


s’(n) 

Normalized sample of s(n) 


Hj (Z) 

Transfer function of j-th biquad 

1 $ j $ 4 

H(Z) 

H 1 (Z)H 2 (Z)H 3 (Z)H 4 (Z) 


r(n) 

-i 

Residual signal of H(Z) 


k i (n) 

Coefficients of a biquad 

i = 1, 2 

k. 

X 

Averaged k^n) over a frame 


k i (n) 

n-th interpolated k^ 

. 

s i (n) 

Biquad sensitivity w.r.t. k^(n) 


u 

Gain factor in the update formula 


r 

Clipping threshold of r(n) 


rms2 

R.m.s of the residual signal 


rms2 (n) 

n-th interpolated rms2 


b 

bits representing a residual 


a, 

i 

Exciting level 

\ $ i < 2 h 







Table 6. System parameters 


System 

a 4 

a 3 

a 2 

a i 

u 

6 

■ 

M 

L 

# 

Sys-4k 

■ 

B 

0.65 

D 

0.02 

D 

B 

9 

32 

Sys-8k 

| 

B 

0.8 

0.2 

0.02 

1.0 

0.5 

n 

32 

Sys-12k 

B 

0.8 


m 

0.025 

1.2 

0.6 

B 

32 

Sys-16k 

■ 

B 

m 

KB 

0.025 

1.2 

0.6 

5 

32 


# : the look-up table is shown in the table 4. 
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Figure 15: Block diagram of the simulation 
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Figure 16: Tracking curves 
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Appendix 1: Description of Speech Files 


This appendix describes a set of files of six speech utterances and 
their pitch estimates generated by Professor T. Barnwell of Georgia Tech. The 
MASSCOMP computer stores the utterances and the pitch estimates under the file 
directory of /usr/spc_smp/ , where the utterances are labeled by SI, S2, 

S6. and the pitch estimates are labeled by PP1, PP2, .... PP6. 



Speech Data Base 


A set of files of speech utterances Is labeled SI, S2, .... S6. The 

files contain 24,576 samples of 12-bit samples taken at a sampling rate of 
8000 samples/sec. Each 12-bit sample is stored in a 16-bit integer word. 
Waveform plots of these utterances are attached. 

The files PP1, PP2, .... PP6 contain accurate estimates of pitch for 

files SI, S2 , .... S6 respectively. The estimates are obtained every 10 msec, 
i.e. , every 80 samples of the waveform. The 307 pitch estimates are the first 
307 numbers in the file. The remaining numbers are aero. 

The numbers in the pitch files are the period of the speech waveform in 

samples where the sampling rate is 8000 samples/sec. A zero pitch period 

indicates unvoiced speech. Plots and listings of the pitch files are 
attached. 

Catalog of Utterances 

SI: "The pipe began to rust while new” (female speaker) 

S2: "Thieves who rob friends deserve Jail” (male speaker) 

S3: "Add the sum to the product of these three” (female speaker) 

S4: "Open the crate but don't break the glass” (male speaker) 

S5: "Oak is strong and also gives shade” (male speaker) 

S6: "Cats and dogs each hate the other” (male speaker) 
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Appendix 2: Quantization of the Blquad Coefficients 


An optimal quantization of the biquad coefficients is discussed in this 
appendix. This analysis shows an optimal bit allocation based on the minimi- 
zation of maximum spectral error. The bit allocation derived here is used in 


our simulations. 



Quantization of the Blouads Coefficients 


I. Introduction 

Let H(Z) = transfer function of the biquads with coefficients Ck i ( n )}, 
where 1 «= 1,2 and n denotes the stage of the biquads. Instead of mean 
squared error, we employ the average of the area of difference between two log 
spectra as a measure. 


AS = ^ / |log|H(e Jw ) | 2 ~ log|H(e Jw )| 2 | dw 

" n ( 1 ) 

Mm 

where H(Z) = transfer function with a perturbation in a particular coefficient 
k^n) (for example, ^(ZJ+Ak^tl)) . The spectral sensitivity with respect to 
^(n) is defined by 


as 

flk^n) 


li® 7 ilog|H(e Jw )| 2 -log|H( e J w )| 2 |dw 

Ak 1 (n)-fO ** i (n) 2n -n 


(2) 


It has been shown that the spectral sensitivity is a good measure for 
Judging a quantization scheme for coefficients in linear predictive systems 
11] . This appendix investigates the quantization properties of the biquads 
coefficients and derives a procedure for the bit allocation by minimizing the 
maximum spectral error. 


II. Spectral Sensitivity 

For simplicity, we take the 4 staged biquads. 
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St 


1 


'H(Z) 


TOT 


* 2 

n k‘(n) 

n=l 1 


TT * — 

°i V z> 


and 


A (Z) * l-[2-k 1 (n)k 2 (n)-kj(n)]Z' 1 +[l-k 1 (n)k 2 (n)]z' 


-2 


( 6 ) 


Tbe spectral sensitivity for our system can be written as follows. 


— S 111 . v *,.. • i / Ilog|A(e 3 “>| 2 -106|J(e- iw >| 2 |dw 


S V“> Ak^(n)—^o 2 "-'„ 


An 

^ 1 / |log|A n (e JW )| 2 -log|A n (eJ w )| 2 |dw 


lim 


Ak i (n)->0 Ak i (n) 


n=l -n 


11- £/ * TT^-Tllog|A (e JW )| 2 -log|A n (eJ w )| 2 |dw 

Ak 1 ( n)-»0 27l -n Ak i (n) n 


= 2n ^Uk,(n) lo 8l A n (eJ 5 1 i dw 


-rt i 




7 £ 



( 7 ) 


where 


and 


2 / o^(n , ®)de 


e^n,®) 


1 . — 9 — , jn®).: 

| A ( e j«e),2 ak^nj'V 6 


( 8 ) 


|A n (e^ ne )| 2 * tk 1 (n)-2(2-k 1 (n)k2(n))sin 2 2 ^ 2 + tk^nJkjCnJsinner 


(9) 


The elimination of the summation in the above derivation is due to the fact 

that |A 1 (e^ w )| 2 has the same coefficients (k^fn)) as |A(e^ w )| 2 except a par- 
ticular single k^(n). The biquad is assumed to be stable, i.e., zeros of 

A (Z) lie within the unit circle (not on the unit circle). Thus, 
n 

1 w 2 

|logU n (e J )| | is bounded. Therefore we can take the derivative. 

1 


dS J 

To compute -r . — — = r / a. (n.e)de, we use the Gauss' formula, i.e. 

^ ^ 


L 

I 

m=l 


sffc * I i v-v 


( 10 ) 


where, for a fixed L, w^ and x^ are given for m= 1,2,...L. 


Directly from (8) and (9), we have 


i) for k^n), n=0,l ,2 ,3 


aj(n.x) 


j2tk 1 (n)-(4-2k 1 (n)k 2 (n))sin 2 |x] ( l+2k 2 (n)sin 2 |x)+2k 1 ( n )k|<n)sin 2 itx 


[k 1 (n)-(4-2k 1 (n)k 2 (n))sin 2 §x] 2 +[k 1 ( n )k 2 (n)sin 2 nx] : 


7i 



( 11 ) 


ii) for k 2 (n), n=0,l,2,3 


o 2 (n,x) 


j2[k 1 (n)-(4-2k 1 (n)k 2 (n))sin 2 |x]2k 1 (n)sin 2 §x+2k*(n)k 2 ( n )sin 2 wc 

I 


j Ik 1 (n)-(4-2k 1 (n)k 2 (n))sin 2 fx] 2 +[k 1 ( n )k 2 (n)sin^«x] 


( 12 ) 

The spectral sensitivity for a particular k 2 (n) does, in general, depend on 
the values of the other coefficients. A useful choice is the simple average 
of the sensitivity ever mary different sets of coefficients fron a large 
number of different speech sounds. 


as l J as 

ak i<n) T ak^n.t) 


(13) 


Figure I shows the of the 4 staged biquads. Ohe average of the sensi- 

tivity was conducted over 10 sets of different coefficients (5 voiced, 5 
unvoiced) from the sample speech Si . In Figure I , the smoothed values result 
in the curves where the exact sensitivity lies within ±1 dB around the curves 
respectively, ftie curves cover practical ranges of each k^(n) for the sample 
speech Si. Si is 'The pipe began It can be noticed that the recon- 

structed speech quality is more sensitive to the quantization error around 
lewer values of k^n). n=0. l, 2 and 3, while the sensitivities of k 2 (n) is 
more uniform. 
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III. Quantization Scheme 

We define the optimal quantization as a quantization which provides a 
flat spectral sensitivity. Thus, the search for the optimal quantization 
scheme reduces to the search for a nonlinear transform that results in a flat 
spectral sensitivity, and then we employ the linear quantization for the 
transformed coefficients. 

Let f(*) be the nonlinear transform such that 


g *= f(k) . kefk^n)} . 

ac 

Since r- is a constant for the optimality, we have 
og 

3S s iS.jJk 
3g dk 3g 


(14) 


= c (constant) 


Thus, 


if i . as 

dk = c 3k 

AC 

If the expression of is given, 
sitivity curves for k 1 (0) and 
represented by 


(15) 

we can obtain f(') by integration. The sen- 
kj(l) in Figure | can be approximately 


as 

3kj(n) 


10 log 10 


1 

l-(k 1 (n)-p) 2 


P=0 .85 


0. Uk^nUO. 8 
n <= 0 and 1 

(16) 


By (15), we obtain 



( 17 ) 


(l-pHk^n) 

f «=,(=)) - 10B 10 (l*J)Hc l( ., fmtM 
0.15+k.(n) 

= log 10 1.85-k 1 ( n ) * 0.1 < k^n) <.0.8 

Figure 3 shows a plot of f(*). Ve have also plotted a line that provides 
close values over 0.1 i k^n) i. 0.8. Therefore, in practice, we could 
linearly quantize k^O) and kjd) as well as other (k^n)) to obtain approxi- 
mately flat sensitivity character! sties. 

IV. Bit Allocation 

Ve derive a procedure for binary bit allocation by minimizing the max- 
imum spectral deviation. Let 

M * the total number of bits for quantization 

(q^, 1=1, 2,... p) = set of coefficients to be quantized 

M i 

= 2 : number of levels for coefficient q^ 

6. = ' m : quantization size, 
i N ± 

where 

q^ = upper bound of q^. 
q^ = lower bound of q A 

For the linear quantization of q^ using round-off arithmetic, the maximum 
quantization error is 

7 1 , 



to 












'sJ Aq ilmax 


( 18 ) 


2 6 1 


The maximum total spectral deviation (AS) is given by 

max ° J 


(AS) 


max 




J £l 

1.1 N i 


where 


K i - ^ i * i p 

The problem is to find N i# 1=1, 2, ... p, minimizing (AS) max 
P 

the constraint I log^N^M. The solution is given by [1], 

1=1 


N. 


.M 


n k 

i=l 


and 


N i - • N 1 


2<iiP 


79 


(19) 


( 20 ) 

subject to 


( 21 ) 



For example, we use the 4-staged biquads, and can make a numerical table 
as follows. Here, we have 26 bits for quantization of the biquads' coeffi- 
cients. 


Tahle l. Bit allocation with M=26 for sample speech Si 1 

I 

ii k i«» 
ii 

k 1 (l) 

k a (2) 

kjO) 

k 2 (0) 

k 2 (i) 

k 2 (2) 

k 2 (3) i 

jpper bound 

j 0.8 

0.8 

1.6 

1.9 

1.2 

1.4 

0.4 

0.14 | 

Lower bound 

1 0.1 

0.1 

0.8 

1.6 

0.02 

0.2 

0.02 

0.02 

"<>s 

^(n) 

| 3.46 

3.16 

2.8 

2.6 

0.48 

m 

2.3 

4 | 

JK^n) 

1 1.211 

1.106 

1.12 

0.39 

0.28 

0.54 

0.44 

0.24 

1 

? i (n)/ K i (0) 

1 

1 

0.91 

0.92 

0.32 

0.23 

0.45 

0.36 

■a 

FTTnJ 

^(nJ-2 x 

Jefore truncation 


18.8 

19 

6.6 

4.75 

9.3 

7.44 

i 

4.13 1 
1 

1 


After truncation of and rearrangement of M bits, we obtain the bit alloca- 
tion for our system for speech Si, as shown in Table 2. 































I Table 2. Bit allocation for coefficients! 

I 1 

coefficient • 

i 

bits 

ras of residual | 

5 

k^O) S 

4 

k 2 (0) | 

3 

kjU) 

4 

k 2 (l) 1 

3 

kj(2) | 

4 

k 2 (2) j 

3 

k l(3) j 

3 

k 2 (3) j 

2 

1 

Total j 

1 

30 bits 


V. Computational Procedure 

1. Using (10). compute -rr— T of different sets of coefficients from many 

ok^n) 

different speech sounds, and take the average by (13). 

2. Osing (15), compute the nonlinear transform fC) and apply the linear 
quantization scheme for the transformed coefficients. 

3. As shown in Table 1, compute the bit allocation 
Ran ark: 

This report has considered the quantization properties of the biquads 
coefficients, and concluded that i) we oould apply the linear quantiza- 
tion directly to (k^n)} and ii) we have the bit allocation for speech 


SI, as shewn in Table 2. 




Reference : 


[1] R. Viswanathan, and J. Makhoul, "Quantization Properties of Transmission 
Parameters in Linear Predictive Systems." IEEE Trans, on ASSP. Vol. 23, June, 


Appendix 3 : List of quantized biquad coefficients 


This appendix shows the list of quantized biquad coefficients of 
sys-4k, where the utterence file of /usr/ee/moon/speech/spl was used. 
kj[j] denotes the coefficient k^ of j-th biquad. 
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PART III 

AUDIO TAPE OF SIMULATIONS 

The simulated results were recorded on an audio cassette tape. 
There are eight different types of utterences on the tape. Each of 
the following type of utterences are repeated several times: 

1) "The pipe began to rust while new" (female speaker) 

2) "Thieves who rob friends deserve jail" (male speaker) 

3) "Add the sum to the product of these three" (female speaker) 

4) "Open the crate but don't break the glass" (male speaker) 

5) "Oak is strong and also gives shade" (male speaker) 

6) "Cats and dogs each hate the other" (male speaker) 

These six utterences are recorded in a clear background environ- 
ment. The next two types have strong background interference. In type 
(7) there is another background voice while in (8) there is a white 
noise background. 

7) "The pipe began to rust while new" (female speaker) 

8) "Cats and dogs each hate the other" (male speaker) 

For each type, the recording order is the original utterence, the 
output of sys-8k, and the output of sys-4k where each utterence is 
repeated twice. In all cases except the original quantization of the 
biquad coefficients has been applied. 
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